The following will make sure you have what you need to run the rest of the code.

library(tidyverse)
model_variables = read.csv('data/model_variables_anonymized.csv')

Machine Learning

Use caret to employ machine learning

Start with some pre-processing of the data

library(caret) # need to install?
set.seed(1234) # so that the indices will be the same when re-run
trainIndices = createDataPartition(model_variables$libuser, p=.8, list=F)

X_train = model_variables %>% 
  slice(trainIndices)

X_test = model_variables %>% 
  slice(-trainIndices)

Example with XGBoost

library(xgboost)  # need to install?

xgb_opts = expand.grid(
  eta = c(.3, .4),
  max_depth = c(9, 12),
  colsample_bytree = c(.6, .8),
  subsample = c(.5, .75, 1),
  nrounds = 100, # 1000 would be more reasonable, but notably time consuming
  min_child_weight = 1,
  gamma = 0
)

cv_opts = trainControl(method='cv', number=10)

Run in parallel

# for parallel processing
library(doParallel)  # need to install?
cl = makeCluster(detectCores() - 1)
registerDoParallel(cl)

results_xgb = train(
  libuser ~ .,
  data = X_train,
  method = 'xgbTree',
  preProcess = c('center', 'scale'),
  trControl = cv_opts,
  tuneGrid = xgb_opts
)

stopCluster(cl)

results_xgb

Machine Learning

preds_gb = predict(results_xgb, X_test)
confusionMatrix(preds_gb, X_test$libuser, positive='yes')

Python

With machine learning, we finally get to a point where Python is on par with and typically surpasses R.

Most techniques that would fall under the heading of machine learning are first developed in Python.

For at least some techniques, Python will typically run faster, possibly notably so, but this depends on many factors.

Init

# note how when using something other than R, you have to specify the engine path

import pandas as pd
import numpy as np
import statsmodels


model_variables = pd.read_csv('data/model_variables_anonymized.csv')

Random forest

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=1000)  # number of trees

rf_opts = {'max_features': np.arange(2,7)}  # tuning parameter
rf_estimator = GridSearchCV(rf, cv=10, param_grid=rf_opts, n_jobs=4)  # 10-fold cv
results_rf = rf_estimator.fit(X_train, y_train)  # requires matrices

Inspect the best result over the tuning parameters

results_rf.best_score_
results_rf.best_params_

Test model on new data

rf_predict = results_rf.predict(X_test)
print(metrics.classification_report(y_test, rf_predict))